Dataset statistics
| Number of variables | 4 |
|---|---|
| Number of observations | 1000 |
| Missing cells | 0 |
| Missing cells (%) | 0.0% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 31.4 KiB |
| Average record size in memory | 32.1 B |
Variable types
| Numeric | 2 |
|---|---|
| Categorical | 2 |
birthDate has a high cardinality: 516 distinct values | High cardinality |
nationality has a high cardinality: 54 distinct values | High cardinality |
birthDate is uniformly distributed | Uniform |
df_index has unique values | Unique |
statementID has unique values | Unique |
Reproduction
| Analysis started | 2022-06-01 21:25:11.553460 |
|---|---|
| Analysis finished | 2022-06-01 21:25:28.728519 |
| Duration | 17.18 seconds |
| Software version | pandas-profiling v3.2.0 |
| Download configuration | config.json |
| Distinct | 1000 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 2623.861 |
| Minimum | 0 |
|---|---|
| Maximum | 5374 |
| Zeros | 1 |
| Zeros (%) | 0.1% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 7.9 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 270.95 |
| Q1 | 1319.75 |
| median | 2564 |
| Q3 | 3944.75 |
| 95-th percentile | 5093.25 |
| Maximum | 5374 |
| Range | 5374 |
| Interquartile range (IQR) | 2625 |
Descriptive statistics
| Standard deviation | 1523.952175 |
|---|---|
| Coefficient of variation (CV) | 0.5808052237 |
| Kurtosis | -1.165308102 |
| Mean | 2623.861 |
| Median Absolute Deviation (MAD) | 1335 |
| Skewness | 0.07968661155 |
| Sum | 2623861 |
| Variance | 2322430.232 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 1795 | 1 | 0.1% |
| 2788 | 1 | 0.1% |
| 1361 | 1 | 0.1% |
| 3677 | 1 | 0.1% |
| 3732 | 1 | 0.1% |
| 3896 | 1 | 0.1% |
| 2056 | 1 | 0.1% |
| 1133 | 1 | 0.1% |
| 2937 | 1 | 0.1% |
| 4609 | 1 | 0.1% |
| Other values (990) | 990 |
| Value | Count | Frequency (%) |
| 0 | 1 | |
| 8 | 1 | |
| 11 | 1 | |
| 13 | 1 | |
| 23 | 1 | |
| 27 | 1 | |
| 32 | 1 | |
| 39 | 1 | |
| 47 | 1 | |
| 48 | 1 |
| Value | Count | Frequency (%) |
| 5374 | 1 | |
| 5365 | 1 | |
| 5348 | 1 | |
| 5338 | 1 | |
| 5336 | 1 | |
| 5322 | 1 | |
| 5305 | 1 | |
| 5303 | 1 | |
| 5298 | 1 | |
| 5297 | 1 |
| Distinct | 1000 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 9.391612659 × 1018 |
| Minimum | 2.261760228 × 1016 |
|---|---|
| Maximum | 1.843121297 × 1019 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 7.9 KiB |
Quantile statistics
| Minimum | 2.261760228 × 1016 |
|---|---|
| 5-th percentile | 1.077589261 × 1018 |
| Q1 | 4.859557379 × 1018 |
| median | 9.457167358 × 1018 |
| Q3 | 1.391643545 × 1019 |
| 95-th percentile | 1.740639106 × 1019 |
| Maximum | 1.843121297 × 1019 |
| Range | 1.840859537 × 1019 |
| Interquartile range (IQR) | 9.056878066 × 1018 |
Descriptive statistics
| Standard deviation | 5.269885424 × 1018 |
|---|---|
| Coefficient of variation (CV) | 0.5611267857 |
| Kurtosis | -1.183598067 |
| Mean | 9.391612659 × 1018 |
| Median Absolute Deviation (MAD) | 4.546394439 × 1018 |
| Skewness | -0.03695536229 |
| Sum | 9.391612659 × 1021 |
| Variance | 2.777169238 × 1037 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 1.894954634 × 1018 | 1 | 0.1% |
| 1.301353082 × 1019 | 1 | 0.1% |
| 1.808278939 × 1018 | 1 | 0.1% |
| 2.879329166 × 1017 | 1 | 0.1% |
| 1.361266249 × 1019 | 1 | 0.1% |
| 1.676381301 × 1019 | 1 | 0.1% |
| 1.368662257 × 1018 | 1 | 0.1% |
| 6.747802971 × 1018 | 1 | 0.1% |
| 4.111929883 × 1018 | 1 | 0.1% |
| 1.060704421 × 1019 | 1 | 0.1% |
| Other values (990) | 990 |
| Value | Count | Frequency (%) |
| 2.261760228 × 1016 | 1 | |
| 3.087674609 × 1016 | 1 | |
| 4.134186759 × 1016 | 1 | |
| 4.897217398 × 1016 | 1 | |
| 6.045753032 × 1016 | 1 | |
| 7.747382497 × 1016 | 1 | |
| 9.664508252 × 1016 | 1 | |
| 1.039521447 × 1017 | 1 | |
| 1.138621581 × 1017 | 1 | |
| 1.338930217 × 1017 | 1 |
| Value | Count | Frequency (%) |
| 1.843121297 × 1019 | 1 | |
| 1.841922293 × 1019 | 1 | |
| 1.841212592 × 1019 | 1 | |
| 1.839990486 × 1019 | 1 | |
| 1.839989714 × 1019 | 1 | |
| 1.839910485 × 1019 | 1 | |
| 1.839329021 × 1019 | 1 | |
| 1.829387604 × 1019 | 1 | |
| 1.828284347 × 1019 | 1 | |
| 1.828166149 × 1019 | 1 |
| Distinct | 516 |
|---|---|
| Distinct (%) | 51.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 7.9 KiB |
| 1985-01-01 | 6 |
|---|---|
| 1987-08-01 | 6 |
| 1961-08-01 | 5 |
| 1991-04-01 | 5 |
| 1981-02-01 | 5 |
| Other values (511) |
Length
| Max length | 10 |
|---|---|
| Median length | 10 |
| Mean length | 10 |
| Min length | 10 |
Characters and Unicode
| Total characters | 10000 |
|---|---|
| Distinct characters | 11 |
| Distinct categories | 2 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 224 ? |
|---|---|
| Unique (%) | 22.4% |
Sample
| 1st row | 1977-11-01 |
|---|---|
| 2nd row | 1976-04-01 |
| 3rd row | 1944-08-01 |
| 4th row | 1983-03-01 |
| 5th row | 1975-11-01 |
Common Values
| Value | Count | Frequency (%) |
| 1985-01-01 | 6 | 0.6% |
| 1987-08-01 | 6 | 0.6% |
| 1961-08-01 | 5 | 0.5% |
| 1991-04-01 | 5 | 0.5% |
| 1981-02-01 | 5 | 0.5% |
| 1965-03-01 | 5 | 0.5% |
| 1976-03-01 | 5 | 0.5% |
| 1968-06-01 | 5 | 0.5% |
| 1986-08-01 | 5 | 0.5% |
| 1981-12-01 | 5 | 0.5% |
| Other values (506) | 948 |
Length
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| 1985-01-01 | 6 | 0.6% |
| 1987-08-01 | 6 | 0.6% |
| 1981-12-01 | 5 | 0.5% |
| 1975-01-01 | 5 | 0.5% |
| 1985-09-01 | 5 | 0.5% |
| 1974-01-01 | 5 | 0.5% |
| 1980-12-01 | 5 | 0.5% |
| 1976-01-01 | 5 | 0.5% |
| 1978-06-01 | 5 | 0.5% |
| 1985-04-01 | 5 | 0.5% |
| Other values (506) | 948 |
Most occurring characters
| Value | Count | Frequency (%) |
| 1 | 2509 | |
| - | 2000 | |
| 0 | 1976 | |
| 9 | 1272 | |
| 7 | 456 | 4.6% |
| 8 | 426 | 4.3% |
| 6 | 369 | 3.7% |
| 5 | 312 | 3.1% |
| 2 | 272 | 2.7% |
| 4 | 229 | 2.3% |
Most occurring categories
| Value | Count | Frequency (%) |
| Decimal Number | 8000 | |
| Dash Punctuation | 2000 | 20.0% |
Most frequent character per category
Decimal Number
| Value | Count | Frequency (%) |
| 1 | 2509 | |
| 0 | 1976 | |
| 9 | 1272 | |
| 7 | 456 | 5.7% |
| 8 | 426 | 5.3% |
| 6 | 369 | 4.6% |
| 5 | 312 | 3.9% |
| 2 | 272 | 3.4% |
| 4 | 229 | 2.9% |
| 3 | 179 | 2.2% |
Dash Punctuation
| Value | Count | Frequency (%) |
| - | 2000 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Common | 10000 |
Most frequent character per script
Common
| Value | Count | Frequency (%) |
| 1 | 2509 | |
| - | 2000 | |
| 0 | 1976 | |
| 9 | 1272 | |
| 7 | 456 | 4.6% |
| 8 | 426 | 4.3% |
| 6 | 369 | 3.7% |
| 5 | 312 | 3.1% |
| 2 | 272 | 2.7% |
| 4 | 229 | 2.3% |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 10000 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| 1 | 2509 | |
| - | 2000 | |
| 0 | 1976 | |
| 9 | 1272 | |
| 7 | 456 | 4.6% |
| 8 | 426 | 4.3% |
| 6 | 369 | 3.7% |
| 5 | 312 | 3.1% |
| 2 | 272 | 2.7% |
| 4 | 229 | 2.3% |
| Distinct | 54 |
|---|---|
| Distinct (%) | 5.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 7.9 KiB |
| GB | |
|---|---|
| RO | 16 |
| IE | 14 |
| PH | 14 |
| PK | 12 |
| Other values (49) |
Length
| Max length | 2 |
|---|---|
| Median length | 2 |
| Mean length | 2 |
| Min length | 2 |
Characters and Unicode
| Total characters | 2000 |
|---|---|
| Distinct characters | 25 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 29 ? |
|---|---|
| Unique (%) | 2.9% |
Sample
| 1st row | GB |
|---|---|
| 2nd row | GB |
| 3rd row | GB |
| 4th row | GB |
| 5th row | GB |
Common Values
| Value | Count | Frequency (%) |
| GB | 843 | |
| RO | 16 | 1.6% |
| IE | 14 | 1.4% |
| PH | 14 | 1.4% |
| PK | 12 | 1.2% |
| DE | 9 | 0.9% |
| ES | 8 | 0.8% |
| PL | 7 | 0.7% |
| PT | 7 | 0.7% |
| TR | 6 | 0.6% |
| Other values (44) | 64 | 6.4% |
Length
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| gb | 843 | |
| ro | 16 | 1.6% |
| ie | 14 | 1.4% |
| ph | 14 | 1.4% |
| pk | 12 | 1.2% |
| de | 9 | 0.9% |
| es | 8 | 0.8% |
| pl | 7 | 0.7% |
| pt | 7 | 0.7% |
| tr | 6 | 0.6% |
| Other values (44) | 64 | 6.4% |
Most occurring characters
| Value | Count | Frequency (%) |
| G | 851 | |
| B | 851 | |
| P | 40 | 2.0% |
| E | 37 | 1.8% |
| R | 27 | 1.4% |
| T | 22 | 1.1% |
| I | 20 | 1.0% |
| K | 18 | 0.9% |
| H | 18 | 0.9% |
| L | 17 | 0.9% |
| Other values (15) | 99 | 5.0% |
Most occurring categories
| Value | Count | Frequency (%) |
| Uppercase Letter | 2000 |
Most frequent character per category
Uppercase Letter
| Value | Count | Frequency (%) |
| G | 851 | |
| B | 851 | |
| P | 40 | 2.0% |
| E | 37 | 1.8% |
| R | 27 | 1.4% |
| T | 22 | 1.1% |
| I | 20 | 1.0% |
| K | 18 | 0.9% |
| H | 18 | 0.9% |
| L | 17 | 0.9% |
| Other values (15) | 99 | 5.0% |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 2000 |
Most frequent character per script
Latin
| Value | Count | Frequency (%) |
| G | 851 | |
| B | 851 | |
| P | 40 | 2.0% |
| E | 37 | 1.8% |
| R | 27 | 1.4% |
| T | 22 | 1.1% |
| I | 20 | 1.0% |
| K | 18 | 0.9% |
| H | 18 | 0.9% |
| L | 17 | 0.9% |
| Other values (15) | 99 | 5.0% |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 2000 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| G | 851 | |
| B | 851 | |
| P | 40 | 2.0% |
| E | 37 | 1.8% |
| R | 27 | 1.4% |
| T | 22 | 1.1% |
| I | 20 | 1.0% |
| K | 18 | 0.9% |
| H | 18 | 0.9% |
| L | 17 | 0.9% |
| Other values (15) | 99 | 5.0% |
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
First rows
| df_index | statementID | birthDate | nationality | |
|---|---|---|---|---|
| 0 | 1795 | 1894954633915448875 | 1977-11-01 | GB |
| 1 | 660 | 8122201078160437121 | 1976-04-01 | GB |
| 2 | 4167 | 14897321233292332437 | 1944-08-01 | GB |
| 3 | 3872 | 4900120917638339431 | 1983-03-01 | GB |
| 4 | 1468 | 2187005604204266902 | 1975-11-01 | GB |
| 5 | 5124 | 8477491518228188624 | 1982-12-01 | GB |
| 6 | 2389 | 11940734352805019661 | 1954-12-01 | GB |
| 7 | 1836 | 14593884978314155394 | 1969-09-01 | GB |
| 8 | 784 | 16910348632076779939 | 1982-06-01 | GB |
| 9 | 768 | 10872922384409726013 | 1997-01-01 | GB |
Last rows
| df_index | statementID | birthDate | nationality | |
|---|---|---|---|---|
| 990 | 4325 | 10301834450507381579 | 1957-10-01 | GB |
| 991 | 3989 | 3623805796509913094 | 1968-09-01 | GB |
| 992 | 1242 | 7922084366984616593 | 2001-09-01 | GB |
| 993 | 4835 | 2126511552858375391 | 1968-06-01 | GB |
| 994 | 3429 | 8975710324249148371 | 2003-07-01 | LV |
| 995 | 2138 | 17596704098375131751 | 1977-01-01 | GB |
| 996 | 3783 | 14218517060360418696 | 1969-02-01 | GB |
| 997 | 1222 | 14997462062646212603 | 1990-04-01 | GB |
| 998 | 4487 | 10081875072092782811 | 1982-03-01 | GB |
| 999 | 970 | 16980272473163185766 | 1981-08-01 | RO |